ENVX1002 Introduction to Statistical Methods
The University of Sydney
Feb 2025
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
Ideal for predicting a continuous response variable from a single predictor variable: “How does y change as x changes, when the relationship is linear?”
Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + \epsilon_i
“How does y change as x_1, x_2, …, x_k change?”
Y_i = f(x_i, \beta) + \epsilon_i
where f(x_i, \beta) is a nonlinear function of the parameters \beta: “How do we model a change in y with x when the relationship is nonlinear?”
Carl Friedrich Gauss (1777-1855) and Isaac Newton (1642-1726) Gauss-Newton approach to non-linear regression is most commonly used
Linear relationships are simple to interpret since the rate of change is constant.
“As one changes, the other changes at a constant rate.”
Nonlinear relationships often involve exponential, logarithmic, or power functions.
“As one changes, the other changes at a rate that is not proportional to the change in the other.
In a relationship where y is a function of x^a, as y increases, x increases at a rate that is equal to x to the power of a.
Interpretation:
| Exponents | Logarithms | |
|---|---|---|
| Definition | If a^n = b, a is the base, n is the exponent, and b is the result. | If \log_a b = n, a is the base, b is the result, and n is the logarithm (or the exponent in the equivalent exponential form). |
| Example | 2^3 = 8 | \log_2 8 = 3 |
| Interpretation | 2 raised to the power of 3 equals 8. | The power to which you must raise 2 to get 8 is 3. |
| Inverse | The logarithm is the inverse operation of exponentiation. | The exponentiation is the inverse operation of logarithm. |
| Properties | (a^n)^m = a^{n \cdot m}, a^n \cdot a^m = a^{n+m}, \frac{a^n}{a^m} = a^{n-m} | \log_a(b \cdot c) = \log_a b + \log_a c, \log_a\left(\frac{b}{c}\right) = \log_a b - \log_a c, \log_a(b^n) = n \cdot \log_a b |
Note
For your understanding, don’t need to memorise.
Often, a nonlinear relationship may be transformed into a linear relationship by applying a transformation to the response variable or the predictor variable(s).
f(x_i, \beta)
Response variable decreases and approaches limit as predictor variable increases.
y = a \cdot e^{-b_x}
Examples: radioactive decay, population decline, chemical reactions.
Response variable increases and approaches a limit as the predictor variable increases.
y = a + b(1 - e^{-cx})
Examples: population growth, enzyme kinetics.
An S-shaped relationship, where the response variable is at first exponential, then asymptotic.
y = c + \frac{d-c}{1+e^{-b(x-a)}}
set.seed(450)
# Simulate data:
logistic <- tibble(predictor = seq(0, 10, by = 0.2),
response = 10 + abs(300 * (1 / (1 + exp(-0.8 * (predictor - 5)))) + rnorm(length(predictor), mean = 0, sd = 10)))
ggplot(data = logistic, aes(x = predictor, y = response)) +
geom_point() +
labs(x = "Predictor", y = "Response")Examples: growth of bacteria, disease spread, species growth.
Response variable changes in a variety of ways as the predictor variable changes. Also known as ‘curvilinear’.
y = a + bx + cx^2 + dx^3 + ...
# Set seed for reproducibility
set.seed(529)
# Simulate data:
curvilinear <- tibble(predictor = seq(0, 30, length.out = 50),
response = 50 * (1 - (predictor - 15)^2 / 225) + rnorm(length(predictor), mean = 0, sd = 5))
ggplot(data = curvilinear, aes(x = predictor, y = response)) +
geom_point() +
labs(x = "Predictor", y = "Response")Examples: food intake, drug dosage, exercise.
How far can we go?
nls() function to fit the following nonlinear models:
A special case of multiple linear regression used to model nonlinear relationships.
Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_k x_i^k + \epsilon_i
where k is the degree of the polynomial.
lm().Using the asymptotic data
See Slide 11 for the relationship and mathematical expression.
Y_i = \beta_0 + \beta_1 x_i + \epsilon_i
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
poly(degree = 2))Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i
poly(degree = 3))Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i
poly(degree = 10))Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_10 x_i^{10} + \epsilon_i
ggplot(asymptotic, aes(x = predictor, y = response)) +
geom_point() +
labs(x = "Predictor", y = "Response") +
geom_line(aes(y = predict(lin_fit)), color = "red") +
geom_line(aes(y = predict(poly2_fit)), color = "slateblue") +
geom_line(aes(y = predict(poly3_fit)), color = "seagreen") +
geom_line(aes(y = predict(poly10_fit)), color = "firebrick", size = 2)# 1) Compute R^2 values
R2_lin <- summary(lin_fit)$r.squared
R2_poly2 <- summary(poly2_fit)$adj.r.squared
R2_poly3 <- summary(poly3_fit)$adj.r.squared
R2_poly10 <- summary(poly10_fit)$adj.r.squared
# 2) Create a simple table (data frame)
model_r2_table <- data.frame(
Model = c("Linear", "Poly2", "Poly3", "Poly10"),
R2 = c(R2_lin, R2_poly2, R2_poly3, R2_poly10)
)
# Display the table
knitr::kable(
model_r2_table,
digits = 3,
caption = "Comparison of R^2^ of Polynomial Models"
)| Model | R2 |
|---|---|
| Linear | 0.570 |
| Poly2 | 0.820 |
| Poly3 | 0.872 |
| Poly10 | 0.862 |
Note
We use adjusted R2 for polynomials - extra terms, extra complexity, so extra penalty.
Call:
lm(formula = response ~ poly(predictor, 10), data = asymptotic)
Residuals:
Min 1Q Median 3Q Max
-17.1659 -8.6908 -0.0494 8.8003 16.4012
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.818 1.552 51.426 < 2e-16 ***
poly(predictor, 10)1 159.368 11.084 14.378 < 2e-16 ***
poly(predictor, 10)2 -106.939 11.084 -9.648 5.37e-12 ***
poly(predictor, 10)3 48.570 11.084 4.382 8.28e-05 ***
poly(predictor, 10)4 -19.411 11.084 -1.751 0.0876 .
poly(predictor, 10)5 1.193 11.084 0.108 0.9148
poly(predictor, 10)6 -2.769 11.084 -0.250 0.8040
poly(predictor, 10)7 -1.343 11.084 -0.121 0.9042
poly(predictor, 10)8 -4.009 11.084 -0.362 0.7195
poly(predictor, 10)9 -2.851 11.084 -0.257 0.7984
poly(predictor, 10)10 5.769 11.084 0.520 0.6056
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.08 on 40 degrees of freedom
Multiple R-squared: 0.8897, Adjusted R-squared: 0.8621
F-statistic: 32.26 on 10 and 40 DF, p-value: 4.846e-16
lm().If you have some understanding of the underlying relationship (e.g. mechanistic process) between the variables, you can fit a nonlinear model.
Y_i = f(x_i, \beta) + \epsilon_i
where f(x_i, \beta) is a nonlinear function of the parameters \beta.
Like the linear model, the nonlinear model assumes INE:
Basically:
\epsilon_i \sim N(0, \sigma^2)
Like all other models we have seen, we focus on the residuals to assess the model fit, since the residuals are the only part of the model that is random.
Source: Wikipedia
use nls() function in R.
formula: a formula object, with the response variable on the left of a ~ operator, and the predictor variable(s) on the right.data: a data frame containing the variables in the model.start: a named list of starting values for the parameters in the model.y = a + b(1 - e^{-cx})
ggplot(data = asymptotic, aes(x = predictor, y = response)) +
geom_point() +
geom_hline(yintercept = 100, linetype = "dashed") +
geom_vline(xintercept = 0, linetype = "dashed") +
## plot the rate
geom_segment(aes(x = 0, y = 0, xend = 2.5, yend = 100),
arrow = arrow(length = unit(0.5, "cm")),
color = "red") +
labs(x = "Predictor", y = "Response")Warning in geom_segment(aes(x = 0, y = 0, xend = 2.5, yend = 100), arrow = arrow(length = unit(0.5, : All aesthetics have length 1, but the data has 51 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
y = a + b(1 - e^{-cx})
Formula: response ~ a + b * (1 - exp(-c * predictor))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a -14.51755 6.64162 -2.186 0.0337 *
b 113.03797 6.44716 17.533 < 2e-16 ***
c 0.62962 0.07142 8.816 1.32e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.21 on 48 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 1.178e-06
y = -14.5 + 113.04(1 - e^{-0.63x}) The R-squared value is not reported for nonlinear models as the sum of squares is not partitioned into explained and unexplained components. You can use the residual standard error and plots instead to compare between models.
What if we don’t estimate our parameters very well? R will either give an error or get there eventually.
Formula: response ~ a + b * (1 - exp(-c * predictor))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a -14.51760 6.64163 -2.186 0.0337 *
b 113.03800 6.44717 17.533 < 2e-16 ***
c 0.62962 0.07142 8.816 1.32e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.21 on 48 degrees of freedom
Number of iterations to convergence: 7
Achieved convergence tolerance: 2.967e-07
Tip
If an error pops up, try different starting values - the rate of change is most likely the problem.
y = c + \frac{d-c}{1+e^{-b(x-a)}}
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
S or sigmoid-shaped curve
y = c + \frac{d-c}{1+e^{-b(x-a)}}
ggplot(data = logistic, aes(x = predictor, y = response)) +
geom_point() +
geom_smooth() +
labs(x = "Predictor", y = "Response") +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_hline(yintercept = 300, linetype = "dashed") +
geom_vline(xintercept = 5, linetype = "dashed") +
# label the lines above
annotate("text", x = 9, y = 0, label = "c", size = 8, vjust = -1) +
annotate("text", x = 0, y = 300, label = "d", size = 8, vjust = 1.5) +
annotate("text", x = 5, y = 100, label = "a", size = 8, hjust = -1) +
## plot the rate
geom_segment(aes(x = 2.5, y = 60, xend = 6, yend = 250),
arrow = arrow(length = unit(0.5, "cm")),
color = "red") +
# label the rate
annotate("text", x = 4, y = 180, label = "b", size = 8, colour = "red", hjust = -1)Warning in geom_segment(aes(x = 2.5, y = 60, xend = 6, yend = 250), arrow = arrow(length = unit(0.5, : All aesthetics have length 1, but the data has 51 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
y = c + \frac{d-c}{1+e^{-b(x-a)}}
Note
It can be done this way, it is not the easiest way to do it.
nls() function requires a formula and starting parameters.SSasymp(), SSlogis(), SSexp etc.Important
We still need to have some understanding of the underlying relationship between the variables to pick the right function.
y = c + \frac{d-c}{1+e^{-b(x-a)}}
Slightly different equation – logistic growth curve that the lowest possible value is 0 (c = 0).
y = \frac{Asym}{1+exp \frac{xmid-input}{scal}}
input: the predictor variable.Asym: the upper limit.xmid: the value of x when y is halfway between the lower and upper limits.scal: the rate of change.SSlogis() guessed the parameters on the first try.
Formula: response ~ c + (d - c)/(1 + exp(-b * (predictor - a)))
Parameters:
Estimate Std. Error t value Pr(>|t|)
c 12.41294 4.30916 2.881 0.00596 **
d 304.72555 4.32716 70.422 < 2e-16 ***
b 0.83920 0.04914 17.076 < 2e-16 ***
a 5.00741 0.06806 73.570 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.64 on 47 degrees of freedom
Number of iterations to convergence: 6
Achieved convergence tolerance: 8.471e-07
The self-starting function for the asymptotic model is SSasymp().
Comparing outputs:
In some cases, the fits are identical, but in others, they are not.
Note: this is non-examinable content but might be useful for your project.
library(tidyr)
# Create a new data frame with predictor values and model predictions
predictions <- data.frame(
predictor = asymptotic$predictor,
Linear = predict(lin_fit),
Poly_2 = predict(poly2_fit),
Poly_3 = predict(poly3_fit),
Poly_10 = predict(poly10_fit)
)
# Reshape the data to long format
predictions_long <- predictions %>%
pivot_longer(cols = -predictor, names_to = "Model", values_to = "response")
# Plot the data
ggplot(predictions_long, aes(x = predictor, y = response, color = Model)) +
geom_point(data = asymptotic, aes(x = predictor, y = response), inherit.aes = FALSE) +
geom_line(linewidth = 1) +
labs(x = "Predictor", y = "Response") +
scale_color_brewer(palette = "Spectral") We can use prediction quality metrics to compare the fits.
Use the broom package to extract the AIC and BIC values from the model fits.
# A tibble: 4 × 3
model AIC BIC
<chr> <dbl> <dbl>
1 linear 453. 459.
2 poly2 409. 416.
3 poly3 392. 402.
4 poly10 402. 425.
predictions <- data.frame(
observed = asymptotic$response,
Linear = predict(lin_fit),
Poly_2 = predict(poly2_fit),
Poly_3 = predict(poly3_fit),
Poly_10 = predict(poly10_fit)
)
errors <- predictions %>%
pivot_longer(cols = -observed, names_to = "Model", values_to = "Predicted") %>%
group_by(Model) %>%
summarise(
RMSE = sqrt(mean((observed - Predicted)^2)),
MAE = mean(abs(observed - Predicted))
)
knitr::kable(errors, digits=2, caption = "Comparison of RMSE and MAE for different models")| Model | RMSE | MAE |
|---|---|---|
| Linear | 19.38 | 15.17 |
| Poly_10 | 9.82 | 8.57 |
| Poly_2 | 12.30 | 9.88 |
| Poly_3 | 10.25 | 8.83 |
Note
Both the RMSE and MAE measure error on the same scale as the response variable. e.g. if the response variable is in kg, the error will be in kg.
This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.